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Abstract 

Motivated by genetic association studies of SNPs with genotype uncertainty, 
we propose a generalization of the Kruskal-Wallis test that incorporates group 
uncertainty when comparing k samples. The extended test statistic is based on 
probability-weighted rank-sums and follows an asymptotic chi-square distribu- 
tion with k — 1 degrees of freedom under the null hypothesis. Simulation studies 
confirm the validity and robustness of the proposed test in finite samples. Ap- 
plication to a genome-wide association study of type 1 diabetic complications 
further demonstrates the utilities of this generalized Kruskal-Wallis test for 
studies with group uncertainty. 
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1 Introduction 



The seminal work by Kruskal and Wallis (1952) provided us a robust rank-based test 
for the fc-sample problem, complementing the parametric approaches such as the one- 
way analysis of variance (ANOVA). In the classical fc-sample problem, data are well 
classified into different categories or groups. However, in many current scientific stud- 
ies the categorical variables are not necessarily deterministic, and the uncertainties 
are quantitatively expressed by probability distributions over attributes. Such classi- 
fication problems often arise in biomedical and bioinformatics applications where data 
mining techniques and classification algorithms are used to obtain class membership 
probabilities. 

A particular motivating example of this work is the genetic association study of sin- 
gle nucleotide polymorphisms (SNPs) for which the genotype group assignments are 
often known with ambiguity. The data uncertainties at these SNPs are typically rep- 
resented by genotype probabilities obtained from various genotype calling algorithms 
(e.g. Carvalho et al., 2010) or imputation algorithms (e.g. Li et al., 2009). Table 1 
provides a hypothetical illustration. In such number of parametric remedies 

have been proposed, including the popular dosage approach, the weighted regres- 
sion method (Aulchenko et al., 2010) and likelihood-based score tests (Schaid et al., 
2002). Although these parametric approaches are satisfactory in many applications, 
investigators often seek complimentary evidence provided by robust non-parametric 
alternatives, safeguarding their statistical analyses against potential model misspec- 
ifications. Therefore, it is of both theoretical and practical importance to generalize 
the Kruskal-Wallis test so that it is applicable to the /c-sample problem but with 
group uncertainty. 

To formulate the testing problem, let Y be a continuous response variable and G 
a categorical variable with k distinct attributes. For instance, in genetic association 
studies, Y denotes the phenotype of interest (e.g. blood pressure or glucose level) 
and G is the genotype variable at a particular SNP with k = 3. The three categories 
for a SNP represent if an individual's genotype at this SNP contain 0, 1 or 2 copies of 
the minor allele (one of the two alleles with population frequency < 0.5). The goal of 
the association analysis between Y and G is to determine if the phenotype Y values 
differ between individuals with different genotype G values. This can be achieved, 
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Table 1. An illustration of SNP genotype probabilities 



Individual 





Genotype 
1 


2 


hard call 


soft call 


1 


0.925 


0.045 


0.030 





0.105 


2 


0.156 


0.102 


0.742 


2 


1.586 


N 


0.375 


0.410 


0.215 


1 


0.840 



for example, by regressing Y on G and other relevant covariates. Hence, the classic 
linear model 

Y j = f i + /3G ] +e j , j = l,2,...,7V, (1) 

with Sj ~ iV(0,<7 2 ), provides a basis for most association analysis or group compar- 
isons. However, results from the robust Kruskal-Wallis test are often obtained as 
preliminary or complimentary evidence. 

In practice, what is available to us may not be the true G group value, but rather 
probabilistic data of G, i.e. the vector of group probabilities Pj = (pij,P2j, • • • ,Pkj), 
where = P(Gj = i) for % — 1, 2, . . . , k and j = 1,2, . . . , N, with J2i=i Pij = 1- 
In genetic association studies of SNPs, depending on the experiment used by each 
application, the genotype of a SNP may be inferred via classification algorithms (e.g. 
Birdseed, (Korn et al., 2008)), imputed via imputation algorithms, (e.g. TUNA 
(Nicolae, 2006), MaCH (Li et al., 2006) and Impute (Marchini et al, 2007)), or 
derived from next generation sequencing calling algorithms (e.g. SYZYGY (Calvo 
et al., 2010) and SNVer (Wei et al., 2011)). In each case, an individual's genotype is 
most likely associated with some level of uncertainty and the probability that the true 
genotype belongs to each of the three genotype groups, p^ = P(Gj = i) for i = 0, 1, 2 
is provided for individual j, as illustrated in Table 1. 

In the presence of genotype group uncertainty, the best-guess approach (also 
known as the hard-call approach) bypasses the problem of incomplete group infor- 
mation by using the most probable group, Gj = {i : p^ = max(p j , py , p 2 j)} , in 
place of Gj in (1). Depending on whether Gj are considered ordinal or categori- 
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cal, a i-test (df = 1) or an ANOVA F-test (df = 2) can be performed, so is the 
Kruskal-Wallis test (df = 2). We refer to these tests as Best-Guess Linear Model 
(BG-LM), Best-Guess ANOVA (BG-ANOVA) and Best-Guess Kruskal-Wallis (BG- 
KW), respectively (Table 2). Although convenient, this best-guess approach fails to 
fully utilize the information in the group probabilities. 

An alternative method is the dosage approach (also known as the expectation- 
substitution or soft-call approach). In this approach, each Gj in (1) is substituted 
by its expectation Gj = pij + 2 x p 2 j. The association evidence is then assessed, for 
example, by a i-test (df = 1) from the regression of Y on G. We refer to this test 
as the dosage test. The main disadvantage of this approach is that it does not allow 
G to be of categorical nature, and it constrains the relationship between Y and G 
to an additive model. There are several other model-based methods that incorporate 
group probabilities (e.g. Marchini et al., 2007; Lin et al., 2008; Kutalik et al., 2010; 
Aulchenko et al., 2010; Schaid et al., 2002), however, model- free counterparts such as 
the Kruskal-Wallis test has not been proposed. 

The remaining paper is organized as follows. In Section 2, we describe the con- 
struction of the generalized Kruskal-Wallis test statistic based on intuitive probability- 
weighted rank-sums, and provide theoretical justifications of its asymptotic chi-square 
distribution. We show that the original Kruskal-Wallis test is a special case of the 
proposed test and discuss how to handle tied observations and possible variations in 
probabilistic data. Section 3 contains simulation studies to evaluate the finite sample 
performance of the proposed test at different levels of group uncertainty. Section 4 
applies the method to data from a genome-wide association study of complications in 
type 1 diabetic patients. Section 5 concludes with additional discussions. Additional 
simulation results are provided in the Supplemental Material. 

2 The Generalized Kruskal-Wallis Test 

2.1 Notation and the Original Kruskal-Wallis Test 

Consider a random sample of size N from a large population consisting of k > 2 
disjoint groups or categories, with population proportions m, i = l,...,k, and 
Y^!i=i 7i"i = 1- Denote by G the categorical variable taking values on Q = {1, 2, . . . , k} 
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and suppose that each category is adequately represented in the sample. Of interest is 
to compare the k groups, of sizes ni, . . . , n*. with Yli=i n i = ^> based on a continuous 
response variable Y. Formally, letting the distribution function of Y over the group 
% be of the form Fi(y) = F(y — 6i), we wish to test 

Hq : Q\ — 02 — • ■ ■ — 9h against Ha '■ not all 8^s are equal. (2) 

When the precise category assignment of G is available, the Kruskal-Wallis test 
for (2) is performed by ranking all the observations together and comparing the sum 
of the ranks for each group. Let rj be the rank of Yj in the overall sample and define 
the indicator variable Zy = l(Gj = i) for the group membership, the Kruskal-Wallis 
test statistic is 

H = — — — -Y — -W+1), (3) 
where rii = J^jLi %v an( i 

N 

Ri = 2^ ZijTj. 
i=i 

Under the null hypothesis of (2), H follows an asymptotic chi-square distribution 
with k — 1 degrees of freedom (Kruskal, 1952; Kruskal and Wallis, 1952). 

2.2 The generalized Kruskal-Wallis Test 

Suppose available to are not G but probabilistic data of G, Pj = (pij,P2j, • • • ,Pkj), 
where pij = P(Gj = i) for i = 1, 2, . . . , k and j = 1,2, ... ,N, with J2i=i Pij = 1- m 
this case, it is intuitive to consider the weighted rank-sum 

N 

for each group as a basis of comparison. 

However, direct replacement of Ri with R* in the original Kruskal-Wallis test 
statistic (3) does not lead to a properly calibrated test statistic. Below we describe the 
construction of the generalized Kruskal-Wallis test, based on this appealing weighted 
rank-sum R*, that has an asymptotic chis-square distribution with k — 1 degrees of 
freedom. 

We make the following assumptions. 
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(Al) p-'s are independent of Y/s. 

(A2) pi = Y,jLiPij/ N — y 7Tj , as N ->■ oo, with < 7^ < 1. 

(A3) E?=i(Pij-Pi) 2 /N > 0, as iV ^ oo. 

(A4) 



EjLifei - Pi)(pi'j - Pi> 



J2j=i(pij - Pi) 2 J2j=i(pi'j-p 



— > pij/ as iV — j- oo. 



The independence assumption (Al) is standard in statistical analyses of explana- 
tory/response data and is reasonable in practice because, for instance, most impu- 
tation algorithms do not consider the phenotype data in the classification of the 
genotype variable (Marchini and Howie, 2010). The assumption (A2) ensures that 
5_)j=i Pij provides a reasonable approximation for rij and that the relative group sizes 
are convergent. Assumptions (A3) and (A4) are required for the (joint) asymptotic 
normality of the linear rank statistic R*. 

If the groups are indeed from an identical population, the r/s can be viewed 
as a random sample drawn without replacement from the first iV integers. Thus, 
E(rj-) = (N+ l)/2, Varfo) = (N 2 - 1)/12 and Covfa, ry) = —(N+ 1)/12, for j ^ f 
under the null hypothesis of (2). The conditional mean and the conditional variance 
of R* given the group probabilities can then be derived as 



, iV + lv^ , Tr „ N(N +1) 

E(R*) =fk= ——Y,Pa and Var(iJJ) = ^ 2 = 1 J J> 



12 



Pi) 5 



respectively. An important distinction between our approach and that of Kruskal 
and Wallis (1952) is that we consider permutations of the first iV integers rather than 
finite-sampling, hence no finite sample correction is required in our derivations. 

2.2.1 The case with two samples 

The following result is due to the asymptotic theory of linear rank statistics (Wald 
and Wolfowitz, 1944; Hajek et al., 1999) and governs our test construction. 
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Theorem 1. Under the null hypothesis of (2) and assumptions (Al)-(AS), the lim- 
iting distribution of 

R*-(N + l)/2 Ef =1 ^' 

— Z 

is standard normal with mean and variance 1. 

Proof. It suffices to show that the sequence (pn,Pi2, ■ ■ ■ ,Pin) satisfies the W condition 
of the Wald-Wolfowitz Theorem (Wald and Wolfowitz, 1944), i.e. 



{N-^xiPa-Pi) 2 } 



_ = 0(1), r = 3,4,.... 



Since < Pij < 1, with pi — > 7Tj, the central sample moments of the sequence 
(j)a,Pi2, ■ ■ ■ ,Pin) are finite: 

1 N 

^E^-ft)^ ^)' forr = 3,4,.... 
i=i 

Assumption (A3) ensures a non-zero variance for (pn,Pi2, ■ ■ ■ ,P%n) and hence the W 
condition holds. For a detailed proof of the asymptotic normality of Ln see, for 
example, Theorem 6.1 of Fraser (1957). □ 

The generalized Kruskal-Wallis test for the two-sample problem takes the form 

where i = 1 or 2. By Theorem 1, H* has an asymptotic x 2 (l) distribution under the 
null hypothesis. Note that it is sufficient to consider only one of the R*'s in H* as 

Rl-(N+ l)/2 Y:? =1 Pi S _ R*2-(N+ l)/2 Ef = iP2 3 
which can be easily verified using py = 1 — p 2 j, for j = 1, . . . , N. 
2.2.2 The case with three samples 

For the three-sample problem, we shall consider the joint distribution of any two R* 
and R*,. The covariance between R* and R*, can be calculated as 

N 

12 
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which yields the correlation 



Pi 



T,f=i(Pij - Pi)(pi'j - p^) 
J2j=i(Pij - Pi) 2 52f=x(Pi'j - Pv 



Theorem 2. Under the null hypothesis of (2) and assumptions (Al) — (AA), the 
limiting distribution of 

R* -(N + l)/2 E?=iPa , r , Rt> ~(N + 1)/2 Ef=iPi'j 
L N = and L N = - 

0~i Oil 

is bivariate normal with means 0, variance 1 and correlation pu> . 

Proof. Since the sequences (pn,Pi2, ■ ■ ■ ,Pin) an d (Pi>i,Pi'2, ■ ■ ■ iPi'n) satisfy the W 
condition, the result directly follows from Theorem 6.3 of Fraser (1957). □ 

Similar to Kruskal and Wallis (1952), the generalized Kruskal-Wallis test can be 
formed by considering the exponent of the bivariate normal distribution of Theorem 
2, multiplied by -2, 



H* 



I -Pi' 
—2pn> 



R* - (N + l)/2 Ef=iPiiY , (Rl> - (N + l)/2 Ef =1 P^ 



+ 

0~i J \ Oil 

R*-(N+ l)/2 J2?=iPiA f R*,-(N+l)/ 2 Z*L lPvj 

Oi \ Oil 



, (6) 



which is asymptotically distributed as y 2 {2) under the null hypothesis. 

An algebraic simplification of H* is difficult because of the Pi/s. However, as in 
the two-sample case, the left-out R\, £ ^ would not be informative once R* and 
R*, are taken into account. 

2.2.3 The case with more than three samples 

Theorem 2 states that the joint distribution of any two linear rank statistic is asymp- 
totically bivariate normal. This result together with the Cramer-Wold device yields 
the joint asymptotic multivariate normality of any k — 1 linear rank statistic. Thus, 
the case with k > 3 is handled by considering the correlation matrix for an arbitrary 
k — 1 of the .Rj's in a similar fashion. The test statistic is formed by the exponent 
of the (k — l)-variate normal distribution after multiplication by —2 and follows an 
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asymptotic ^ik — 1) distribution under the null hypothesis. An R-code that im- 
plements the generalized Kruskal-Wallis test for any k > 2 is available at author's 
website. 

2.3 Further Considerations 

2.3.1 H-test as a special case 

The generalized Kruskal-Wallis test reduces to the original Kruskal-Wallis test when 
Gj is known for each subject. In this case, we define p^ = (Z\j, Z 2 j, . . . , Z^j). Then, 
R* = Ri and it is easy to show that 

N + 1 n 
fa = — g— rk = E(i2(), d] = — (N + 1)(N - n) = Var( J R i ), 

and that 

p iV = f^-^j (j^) = Cor( Rl , R,), for i + if. 
Hence, i7-test is a special case of H*. 

Remark. A major advantage of the generalized Kruskal-Wallis test is that it does 
not require a separate treatment for partially uncertain categorial data. In line with 
the above approach, when Gj is available, the ranks are allocated to the corresponding 
groups with probability 1, i.e. Pij = (Z^, Z 2 j, . . . , Z^j) and when there is uncertainty 
in the group membership, the probability-weighing principle in (4) guard the testing 
procedure against possible mis classifications. 

2.3.2 Correction for ties 

The generalized Kruskal-Wallis test can be corrected for ties in the same way as the 
original Kruskal-Wallis test. When ties occur in the data, one typically assigns the 
mean of the tied ranks to each member of the tied group. This specification, while 
not affecting the mean rank, reduces the population variance by ^T/(12A r ) and 
increases the population covariance by ^T/{12A r (A r — 1)}, where T = it — 1)£(£ + 1) 
for each group of ties, t denotes the number of ties in the group, and the summation 
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is taken over all groups. Thus, in the case of ties, the variance of R* decreases by 



The corrected generalized Kruskal-Wallis test for ties is hence obtained by sub- 
stituting the adjusted <7j and pa* in H* . Note that, in many situations the difference 
between the generalized Kruskal-Wallis test and its tie-corrected version would be 
negligible barring an excessive number of ties. The R-code provided at author's web- 
site accommodates tie- correct ion in the generalized Kruskal-Wallis test. 

2.3.3 Exact distribution 

The exact null distribution of the Kruskal-Wallis test for three samples, each with up 
to five observations, is given in Kruskal and Wallis (1952). More extensive tables are 
later provided by Iman et al. (1975) for up to eight observations in each sample. With 
a moderately large number of observations, the exact probability calculations of the 
H statistic become cumbersome, even with modern computing power. In the case 
of the generalized Kruskal-Wallis test, similar tabulations, even for small samples, 
seem intractable due to probability weights and remain an open problem. Similar 
to the suggestion of Kruskal and Wallis (1952), we recommend using the chi-square 
approximation when the Ylf=iPi/ s are a ^ l eas t five. 

2.3.4 Variation in group probabilities 

The proposed generalized Kruskal-Wallis test is conditional on observed data. There- 
fore, in our proposal, we treat the p-'s as fixed quantities that define the underlying 
probability distribution of the unknown Gj's. The treatment is reasonable for genetic 
applications as there is little variation in the genotype probabilities if the same al- 
gorithm is employed more than once provided the input data are the same (e.g. Pei 
et al, 2008). 




Similarly, for i ^ i', the covariance between R* and R*, is reduced by 



N 



12(N- 1) 



^2(Pij -Pi)(Pi>j -Pi')- 
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In an unconditional inference, the variation in the group probabilities due to 
random sampling needs to be taken into account. This amounts to specifying a 
probability distribution for the Pi/s, which may not be feasible in practice. Here, 
we briefly discuss the validity of the proposed test when the group probabilities are 
random but treated as fixed, focusing on the case k = 2. For the distribution of the 
Pi/s, while one may consider, for instance, a mixture of two beta distributions, we 
prefer to keep our argument as general as possible. 

Suppose the vector of probabilities {pa, . . . ,Pin} are drawn independently from 
a probability distribution with mean and variance satisfying conditions analogous 
to (A1)-(A3). Let /i* and a* 2 denote the unconditional mean and variance of R*, 
respectively. It is easy to see that 

iV + 1 N 
3=1 

and, by variance decomposition, we obtain 

N / , T 1 x 2 N 



*2 



N{N + l) ^ . 21 /JV+1VA r , , 



12 ^ 

3=1 v ' 3=1 



= E^+Var^), 

where /2j and of remain the conditional mean and the conditional variance of R*. 

The statistic in H*, normalized according to the conditional quantities, can be 
written as 

~a °i V < < J ' 1 ' 

The first quantity in the parenthesis can be shown to be asymptotically standard 
normal under certain conditions similar to (Al)-(A3), and the second quantity has 
an approximate normal distribution with mean zero and variance Var(/ij) /V* 2 , by the 
central limit theorem. Note that 

{N + 1\ 2 N 
CoviR*,^) = y 2 J Yl Var fei) = Var (/^) 



and hence 



3=1 



HLJh A at f o, e < s - 



a* 2 
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Using Slutsky's theorem, we can obtain [At — jlA jdi — > A^(0,1), provided that 
of — — >■ E(of ). The latter condition is satisfied when Vai{(pij — TTi) 2 } = r 2 > 0. 

Thus, under certain assumptions, the generalized Kruskal-Wallis test statistic 
remains valid when group probabilities are random. The application in Section 4 
provides further evidence of this conclusion. 

3 Simulations 

Here we evaluate the methods via simulation studies. Specifically, (i) we evaluate the 
validity of the asymptotic null distribution of the generalized Kruskal-Wallis test in 
finite samples, and (ii) we compare the finite sample performance of the generalized 
Kruskal-Wallis test with those of commonly used parametric tests as summarized in 
Table 2. We focus on the dosage approach as it is the most popular method used 
in practice, and previous work that compared various parametric approaches also 
recommended its usage (Zheng et al., 2011). 

Table 2. Summary of the association tests compared 



Test 



df Incorporate Robust to 

Group Uncertainty Model Assumptions 



I ; (poj,Pij,P2j) is used to determine the most likely genotype group 



BG-LM 

BG-ANOVA 

BG-KW 

Dosage 
GKW 



Best-Guess Linear Model 1 
Best-Guess ANOVA 2 
Best-Guess Kruskal-Wallis 2 



No 
No 
No 



II: (poj , pij , P2j ■) is used directly in the test 



Dosage 1 
Generalized Kruskal-Wallis 2 



Yes 
Yes 



No 
No 
Yes 

No 
Yes 



3.1 Simulation Methods 

As a proof of principle and being consistent with the motivation of this work, we 
simulated genetic association data. Phenotype and SNP genotype data for n — 1, 000 
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individuals were generated as follows. 

The three genotype groups were coded as G = 0, 1 and 2 copies of the minor allele 
of a SNP. The SNP of interest had a minor allele frequency of 20%, leading to expected 
group size of (n , n\, n 2 ) = (640,320,40) based on the multinomial distribution with 
parameters (0.64, 0.32, 0.04) under the Hardy-Weinberg equilibrium assumption. We 
also considered minor allele frequency of 10% and other values. 

The phenotype data were generated from an additive normal model, favorable 
to the parametric methods. Without loss of generality, Y values were simulated 
from normal distribution with equal mean of (2,2,2) under the null model, and 
(1.75,2,2.25) under the alternative model for the three genotype groups G — 0, 1 
and 2, respectively, with a common variance, a 2 = 1. The mean values were cho- 
sen such that power (at the a = 0.01 level) to detect association between Y and 
G is about 95% for minor allele frequency of 20% (and about 70% for minor allele 
frequency of 10%) with the given sample size and without genotype uncertainty. 

Other simulating parameter values were also considered with varying minor allele 
frequencies, type 1 error rates and sample sizes, as well as non-normal or non-additive 
models (Table SI). Additional results are provided in the Supplemental Material and 
conclusions are characteristically similar to the ones reported here. 

Given a true genotype G, to simulate the probabilistic genotype data, we used 
the Dirichlet distribution with scale parameters a for the correct genotype category 
and (1 — a)/2 for the other two, where a — 1, 0.9, 0.8, and 0.7, corresponding 
to an increasing level of group uncertainty ranging from 0% to 30%. Under this 
Dirichlet simulating model, the proportion of the individuals whose most probable 
(best-guessed) genotypes are the correct ones is approximately a (Table 3). 

For each set of probabilistic data, M = 10, 000 experiments under the null model 
and 5, 000 experiments under the alternative model were conducted by simulating 
only the response data. Application in Section 4 confirms that methods comparison 
is not affected by how the probabilistic data were generated. 

3.2 Evaluation of Accuracy 

Table 4 provides the empirical type 1 error rates of the five tests considered. The 
group uncertainty does not seem to alter the accuracy of any of the tests in an obvious 
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Table 3. The empirical proportion of the individuals whose most probable (best- 
guessed) genotypes are the correct ones, a is the Dirichlet parameter for the correct 
genotype category. SNP has a minor allele frequency of 20%. Results for other frequen- 
cies are similar. 



a 


average 


G = 


G = 1 


G = 2 


1 


1.00 


1.00 


1.00 


1.00 


0.9 


0.93 


0.92 


0.96 


0.91 


0.8 


0.83 


0.82 


0.85 


0.86 


0.7 


0.74 


0.74 


0.74 


0.80 



way. For the proposed GKW test, we also compare the quantiles of the empirical 
distributions with those of the x 2 (2) distribution. Figures SI and S2 indicate that 
the empirical null distribution coincides with the asymptotic one, which is further 
supported by the Kolmogorov-Smirnov test with p-values 0.279, 0.595, 0.628 and 
0.599 for SNP with minor allele frequency of 20% and 0.174, 0.377, 0.375 and 0.194 
for SNP with minor allele frequency of 10%, from the lowest to the highest uncertainty 
levels. 

Table 4. Empirical type 1 error rates of the five tests at a = 0.01 under a normal null 
model, for testing the association of a SNP that has minor allele frequency of 20% or 
10%. a is the parameter value of the Dirichlet distribution used to simulate genotype 
probabilities. 



minor allele frequency = 0.2 minor allele frequency = 0.1 



uncertainty level 


0% 


10 % 


20 % 


30 % 


0% 


10 % 


20 % 


30 % 




(a=l) 


(a = 0.9) 


(a = 0.8) 


(a = 0.7) 


(a = l) 


(a = 0.9) 


(a = 0.8) 


(a = 0.7) 


BG-LM 


0.0109 


0.0108 


0.0090 


0.0096 


0.0093 


0.0091 


0.0095 


0.0105 


BG-ANOVA 


0.0098 


0.0106 


0.0109 


0.0094 


0.0083 


0.0089 


0.0103 


0.0102 


BG-KW 


0.0094 


0.0091 


0.0100 


0.0097 


0.0075 


0.0096 


0.0098 


0.0094 


Dosage 


0.0109 


0.0105 


0.0091 


0.0099 


0.0093 


0.0092 


0.0095 


0.0095 


GKW 


0.0094 


0.0087 


0.0091 


0.0088 


0.0075 


0.0088 


0.0092 


0.0095 
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3.3 Evaluation of Efficiency 

We examine the empirical relative efficiency of the other tests compared to the pro- 
posed generalized Kruskal-Wallis test under the alternative model, using the normal 
additive data favorable to the model-based tests (Figure 1). The empirical power 
of each test was obtained using the corresponding empirical type 1 error threshold 
reported in Table 4. Note that, when the true genotypes are used (i.e. a = 1 with 
0% group uncertainty), the GKW and the BG-KW are equivalent. This is also true 
for the dosage test and the BG-LM. 

As expected, the dosage has the best power when there is no group uncertainty 
(Figure 1 at 0% uncertainty level), because the data were generated under the best 
scenario for the dosage test (i.e. phenotype Y was normally distributed with pop- 
ulation means, (1.75,2,2.25), increasing in an additive manner with respect to the 
number of copies of the minor allele, G = (0, 1, 2)). When the minor allele frequency 
is 20%, the power of the dosage test remains (slightly) higher than the generalized 
Kruskal-Wallis test even as the uncertainty level increases (Figure 1(a)). However, 
this is no longer true for detecting SNPs with minor allele frequency of 10% (Fig- 
ure 1(b)). For example, with uncertainty level at 30% (a = 0.7), the relative efficiency 
of all other tests including the dosage test is about 60% as compared to the generalized 
Kruskal-Wallis test. 

Moreover, in less favorable models, (e.g. Figure S7 for non- normal model), the gen- 
eralized Kruskal-Wallis test can outperform the others even when there is no genotype 
uncertainty. Additional simulation results with different minor allele frequencies (e.g. 
0.05 and 0.3), different type 1 error rate (e.g. 0.05 and 0.001) and different model 
assumptions (e.g. non-additive model) as presented in Figures S4-S8 all confirm the 
robustness of the generalized Kruskal-Wallis test. Under scenarios favorable to model- 
based methods, the generalized Kruskal-Wallis test provides comparable power; with 
model misspecification or increased genotype uncertainty, the generalized Kruskal- 
Wallis test can outperform the others. 
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(a) (b) 

Figure 1. Relative efficiency of other tests as compared to the GKW test, under a 
normal additive model, for testing the association of a SNP that has (a) minor allele 
frequency of 20%, or (b) minor allele frequency of 10%. 

4 Data Applications 

We demonstrate the proposed method in three applications using data from the 
genome-wide association study of complications in type 1 diabetes patients, for which 
over 800K SNPs were genotyped by the Illumina 1M beadchip assay and over 1.5 mil- 
lion ungeno typed SNPs were imputed (Paterson et al., 2010). 

The study sample consists of n = 1, 300 subjects with type 1 diabetes (664 treated 
conventionally and 636 treated intensively) from the Diabetes Control and Complica- 
tions Trial (DCCT). The phenotypes of interest are glycosylated hemoglobin (HbAlc) 
and diastolic blood pressure (DBP), collected quarterly from each patient over the 
course of the DCCT. 

The first application illustrates the case of no association using 27, 265 ungeno- 
typed but imputed SNPs on chromosome 22. The other two applications evaluate the 
performance of the generalized Kruskal-Wallis test in detecting putative associations 
of two genotyped SNPs with DBP and HbAlc. 
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4.1 P- value Distribution of H* on Chromosome 22 



We investigated the p-value distribution of the generalized Kruskal-Wallis test using 
chromosome 22 data under the null hypothesis of no association. The genotype data 
at 33, 815 SNPs were imputed (provided by Dr. Andrew Paterson's research group) 
using MaCH (Li et al., 2006, 2009) using HapMap II phased data as the reference 
panel. In the association analysis, we considered 27, 265 SNPs that yielded the sum 
of genotype probabilities of at least five for each of the three genotype groups. 

The phenotype data, mean DBP measurements over the first six study periods, 
were first permuted to eliminate any possible associations. We observed that the 
null distribution of p- values obtained from the x 2 (2) approximation of the general- 
ized Kruskal-Wallis test statistic closely matches the theoretical uniform distribution. 
Figure 2 displays the corresponding quantile-quantile plot on the — log 10 p scale. The 
Kolmogorov-Smirnov test for uniformity yields p-value= 0.084, providing additional 
evidence for the validity of the proposed generalized Kruskal-Wallis test in finite sam- 
ples, because in contrast to the simulations, the group probabilities are random in 
that their distributions vary between imputed SNPs and among the patient subjects. 




1 2 3 4 5 

Expected -logi (p- value) 

Figure 2. The quantile-quantile plot of the — log w (p- values) of the GKW test applied 
to assess association between 27,265 imputed SNPs on chromosome 22 and permuted 
diastolic blood pressure phenotype measures in 1,300 subjects with type 1 diabetes. 
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4.2 DBP with rs7842868 



Recently, Ye et al. (2010) identified rs7842868 on chromosome 8 as a SNP associated 
with DBP with p-value ~ 4.5 x 10~ 8 . Here, we examined the performance of the five 
tests in detecting this association. Since GKW and BG-KW are readily advantaged 
in the case of non-normal data, we base our comparisons on the natural logarithm of 
the DBP measurements averaged over the first six study periods. The histogram of 
the phenotype data for the 1, 300 patients is given in Figure 3(a). 

The observed genotype data at rs7842868 yielded the group sizes (no,ni,n 2 ) = 
(788,446,66). When tested for association, rs7842868 was found to be significant by 
all tests with similar results (Figure 3(a) at 0% uncertainty level). GKW (equivalent 
to BG-KW when genotypes are known) had a marginally higher statistical significance 
with p-value = 1.36 x 10^ 8 , followed by the dosage test and BG-LM with p-value = 
2.66 x 10~ 8 . 

We then masked the actual data and simulated 1, 000 replicates of in silico geno- 
types (i.e. genotype probabilities) from the Dirichlet distribution at each group uncer- 
tainty level, as described in Section 3. The results averaged over the 1, 000 replications 
are shown in Figure 3(a) and verify the robustness of the GKW test. For instance, 
at 10% or higher uncertainty level, the proposed GKW test had noticeable better 
performance than all other procedures. Nevertheless, power to detect association by 
any method deteriorated considerably as genotype uncertainty increased. 

4.3 HbAlc with rsl358030 

The SNP rsl358030 on chromosome 10 was reported to be genome-wide significantly 
associated with HbAlc (p- value = 5 x 10 -9 ) in the conventionally treated group (Pa- 
terson et al., 2010). We performed association analyses using the observed genotypes 
and masked probabilistic genotypes in a fashion similar to the above DBP application. 

The histogram of the phenotype, average log(HbAlc) measurements over the first 
six study periods, is given in Figure 3(b). for the n = 664 conventionally treated 
patients. The genotype group sizes at rsl358030 were (no,ni,n 2 ) = (267,307,90). In 
this application, when the actual genotypes were used, the dosage test (equivalent 
to BG-LM) showed the best performance in detecting the HbAlc association (Figure 
3(b) at 0% uncertainty level). However, the advantage of the dosage test dissipated 
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Figure 3. The significance, on the — log 10 (p- values) scale, of the association tests 
at different genotype uncertainty levels for (a) DBP with rs7842868, (b) HbAlc with 
rsl358030. 



as the genotype uncertainty level increased. 



5 Conclusions and Disucssions 

In this paper, we generalized the rank-based nonparametric Kruskal-Wallis test to al- 
low for group uncertainty when comparing k samples. The proposed generalized test 
statistic follows an asymptotic chi-square distribution with k — 1 degrees of freedom, 
suitable for statistical inference of large-scale data (e.g. genome-wide association or 
next-generation sequencing data) without the need for permutation or other compu- 
tationally inefficient procedures. 

Although the work was originally motivated by the analyses of SNPs with genotype 
uncertainty in genetic association studies, it can be readily applied to other scientific 
studies. Extensive simulations and several applications showed that the generalized 
Kruskal-Wallis test provides a good balance between robustness and power. Our 
proof-of-principle stimulation studies could be improved to more closely mimic real 
genetic data and models. However, the validity and robustness conclusion character- 
istically holds, given the combined evidence from our theoretical work, and simulation 
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and application studies. 

The proposed generalized Kruskal-Wallis test has its limitations. For example, the 
exact distribution for small samples (e.g. relevant to genetic association studies of rare 
variants) is unknown. The current test does not include other covariates. However, 
this limitation could be partially alleviated by considering the residuals from a regres- 
sion model that accounts for the effects of other covariates first, assuming there is no 
SNP-covariate interaction. Both the original and the generalized Kruskal-Wallis tests 
are formulated for continuous outcomes, however, the probability-weighting principle 
exploited here could potentially be extended to analyze case-control data or other 
categorical outcomes and is the subject of ongoing research. 

In conclusion, the proposed generalized Kruskal-Wallis test provides scientists 
a tool to continue investigate, in a robust non-parametric fashion, the /c-sample 
problems in the presence of group uncertainty. The generalized Kruskal-Wallis test 
includes the original Kruskal-Wallis test as a special case, and its power is com- 
parable to parametric counterparts under conditions favorable to the model-based 
approaches. When there is model misspecification or high group uncertainty, the 
generalized Kruskal-Wallis test can outperform the others. 
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Supplemental Material 



This supplement contains the results of additional simulations under various scenarios, 
as well as the tables and figures cited in the main text. Table SI summarizes the study 
purpose and the settings of our simulation studies where sample size is 1,000. Other 
sample sizes (e.g. 500 or 2,000) were also studied, but results were categorically 
similar, therefore not reported here. 

Table SI. Reader Guide for the simulation results. 
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Figure SI. The quantile-quantile plots for the GKW test statistic, under a normal 
null model, for testing the association of a SNP that has minor allele frequency of 20%. 
The Kolmogorov-Smirnov test has p-values of 0.279, 0.595, 0.628 and 0.599, for the 
lowest to the highest uncertainty levels. 
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Figure S2. The quantile-quantile plots for the GKW test statistic, under a normal 
null model, for testing the association of a SNP that has minor allele frequency of 10%. 
The Kolmogorov-Smirnov test has p- values of 0.174, 0.377, 0.375 and 0.194, for the 
lowest to the highest uncertainty levels. 
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Figure S3. Histograms of (a) mean of the natural logarithm of DBP measurements 
of 1,300 patients over the first six study periods, (b) mean of the natural logarithm 
of HbAlc measurements of 664 conventionally treated patients over the first six study 
periods. 
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Figure S4. Relative efficiency of other tests as compared to the GKW test at a = 0.01, 
under a normal additive model, for testing the association of SNP that has (a) minor 
allele frequency of 5%, (b) minor allele frequency of 30%. 
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Figure S5. Relative efficiency of other tests as compared to the GKW test at (a) 
a = 0.05, (b) q = 0.001, under a normal additive model, for testing the association of 
SNP that has minor allele frequency of 20%. 
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Figure S6. Relative efficiency of other tests as compared to the GKW test at (a) 
a = 0.05, (b) q = 0.001, under a normal additive model, for testing the association of 
SNP that has minor allele frequency of 10%. 
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Figure S7. Relative efficiency of other tests as compared to the GKW test at a = 0.01 
under a non-normal additive model, for testing the association of SNP that has (a) 
minor allele frequency of 20%. (b) minor allele frequency of 10%. The non- normal data 
were obtained by taking the exponent of the normal data generated as in Section 3. 



30 



BG-LM 
BG-ANOVA 
BG-KW 
Dosage 
•- GKW 




uncertainty level 



(a) 



o 
o 

II 
a 



o 
3= 
LU 

<D 
> 

<D 




uncertainty level 



(b) 



Figure S8. Relative efficiency of other tests as compared to the GKW test at a = 0.01 
under a normal non-additive model, for testing the association of SNP that has (a) 
minor allele frequency of 20%. (b) minor allele frequency of 10%. The data under 
non-additive model were generated from normal distribution with means (1.75, 2.25, 2) 
for the three genotype groups G = 0, 1 and 2, respectively, with a common variance, 
a 2 = l. 
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